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Abstract 

Relational association rules reveal patterns hide in multiple tables. Existing 
rules are usually evaluated through two measures, namely support and confi- 
dence. However, these two measures may not be enough to describe the strong- 
ness of rules. In this paper, we introduce granular association rules with four 
measures to reveal connections between concepts in two universes, and propose 
three algorithms for rule mining. Two examples of such associations might be 
"men like alcohol" and "young men like France alcohol." With four measures, 
namely source coverage, target coverage, source confidence and target confi- 
dence, our rules are semantically richer than existing ones. Three subtypes of 
rules are obtained through considering special requirements on the source con- 
fidence and the target confidence. Then we define a rule mining problem, and 
design a sandwich algorithm with different rule checking approaches for different 
subtypes. Experiments on a real world dataset show that the approaches dedi- 
cated to three subtypes are 2-3 orders of magnitudes faster than the one for the 
general case. Moreover, a forward algorithm and a backward algorithm for one 
particular subtype can speed up the mining process further. This work opens 
a new research trend concerning relational association rule mining, granular 
computing and rough sets. 

Keywords: Granular computing, relational association rule, measure, concept, 
complete match, partial match. 



1. Introduction 

Relational data mining approaches [TOl [11] look for patterns that involve 
multiple tables in the database. Important issues include relational association 
rule discovery (see, e.g., [J [HI [HI HSl 1111 US] ) , relational decision trees (see, e.g., 
[5] [55]), and relational distance-based learning (see, e.g., [H]). These issues are 
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undoubtedly more general and more challenging than their counterparts on a 
single data table. Therefore they become popular in recent years. 

People have proposed various types of relational association rules for dif- 
ferent applications. For example, Dehaspe et al. chained binary relations 
to produce ternary relations, quaternary relations, etc, and then constructed 
rules from new relations. Jensen et al. ^I9j joined a number of primary tables 
through the central relationship table, then constructed rules from the new ta- 
ble. Goethals et al. [M] constructed rules from two queries, where one asks 
for a set of tuples satisfying a certain condition, and the other asks for those 
tuples satisfying a more specific condition. Kavurucu et al. [H] considered one- 
to-many relationships, and induced logical patterns valid for given background 
knowledge through Inductive Logic Programming (ILP). Goethals et al. [T^ 
also constructed rules from frequent itemsets across entities and binary rela- 
tions, with a key specified such that the occurrences of itemsets are counted in 
one entity table. 

These rules are usually evaluated through two measures, namely support and 
confidence^ which are well defined for association rules [3J I36j in a single data 
table. Unfortunately, these two measures may not be enough to describe the 
strongness of relational association rules. For example, according to [13] we may 
obtain a rule "75% female professors teach courses with 10 credits, among 30% 
of all courses." In fact, a professor may teach only one course with 10 credits, 
or she may teach all courses with 10 credits. Neither measure distinguishes this 
kind of difference. 

In this paper, we introduce granular association rules with four measures 
to reveal connections between concepts in two universes. The term "granular" 
comes from granular computing |25J HHl IIH SI] , which is an emerging concep- 
tual and computing paradigm of information processing [S]. It indicates that 
concepts can take any granule specified by an attribute subset. Let us consider a 
database with two entities customer and product connected by a relation buys. 
Examples of granular association rules include "men like alcohol," "young men 
like France alcohol," and "Chinese women like white stuff." The second rule 
has a finer granule than the first one since young men is a subset of men, and 
France alcohol is a subset of alcohol. However, the third rule neither finer or 
coarser than the second one because the left part of the second rule is con- 
cerned with age and gender, while the left part of the third one is concerned 
with country and gender. Naturally, a direct application of this type of rules is 
product recommendation, often referred to as collaborative recommendation [1] 
or collaborative filtering [TS]. 

We propose four measures to evaluate the quality of a granular association 
rule. An example of such a rule might be "40% men like at least 30% kinds 
of alcohol; 45% customers are men and 6% products are alcohol." Here 45%, 
6%, 40%, and 30% are the source coverage, the target coverage, the source 
confidence, and the target confidence, respectively. The support measure, which 
is well defined for other association rules, is redundant since it is equal to the 
product of the source coverage and the source confidence. With these four 
measures, the strongness of the rule is well defined. This is one reason why the 
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new type of rules are semantically richer than most of existing ones. The reason 
hes in that the new type is more specific than some existing ones which span 
across more than two miiverses (see, e.g., [9l[l9]) or even the whole database 
(see, e.g., [HIH]). 

In some cases the source confidence and/or the target confidence might be 
100%, resulting in three subtypes with some properties. When the source con- 
fidence is 100%, the rule is called a right-hand side partial match one. When 
the target confidence is 100%, the rule is called a left-hand side partial match 
one. When both measures are 100%, the rule is called a complete match one. 
In correspondence with these terms, when neither measure is 100%, the rule is 
called a partial match one. We may also view partial match rules as a general 
case without requirements on the source confidence and the target confidence. 

Our objective is to mine all granular association rules satisfying thresholds 
of four measures. We design a sandwich rule mining algorithm for this pur- 
pose. With this algorithm, candidate concepts are generated in each universe 
according to the source coverage and target coverage thresholds using existing 
algorithms such as Apriori [3] or FP-growth [T^. Then candidate rules are 
generated and checked. Rules meeting the source confidence and the target 
confidence thresholds are output. The rule checking approach for partial match 
rules is inefficient for other subtypes. Therefore we design different rule checking 
approaches for three subtypes to fully take advantage of their characteristics. 

We also design two more algorithms to mine complete matching rules. They 
are called the forward algorithm and the backward algorithm, respectively. 
Lower approximation, which is a key concept in rough sets |33|, is employed 
to analyze both algorithms. Hence granular association rule mining can be 
viewed as as a new application of rough sets. 

Experiments are undertaken on the course selection data from Zhangzhou 
Normal University. Some interesting rules are obtained through setting reason- 
able thresholds of four measures. The efficiencies of different approaches are 
compared through different settings on four thresholds. For the sandwich algo- 
rithm, rule checking approaches designed for three subtypes are 2-3 orders of 
magnitude faster than the one for the general case. Moreover, a forward algo- 
rithm and a backward algorithm, which are valid for complete matching, can 
enhance the performance further. 

The rest of the paper is organized as follows. Section|2]reviews three types of 
classical association rules and five types of relational association rules. Section 
[3] defines the data model for granular association rules and three subtypes of 
rules. Then Section[4]defines the problem and presents a sandwich algorithm for 
the problem. A forward algorithm and a backward algorithm are also designed 
to mine complete match rules. Experiments on the course selection data are 
discussed in Section [5] Finally, Section |6] presents the concluding remarks and 
further research directions. 
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2. Related works 

In this section, we review popular association rule mining problems and 
respective approaches. We will begin with association rules in a single data 
table, and then proceed to association rules involving multiple tables. 

2.1. Association rules 

Association rules on a single data table have been well-studied. These are 
boolean association rules, quantitative association rules, and multi-level associ- 
ation rules. 

2.1.1. Boolean association rules 

The concept of association rule was first introduced in [2] to mine transaction 
data of a supermarket. This concept was renamed as boolean association rule 
[36] to distinguish with other types of association rules. The transaction data, 
also called the basket data, store items purchased on a per-transaction basis. 
An example of such rule is "30% of transactions that contain beer also contain 
diapers; 2% of all transactions contain both of these items." Here 30% and 2% 
are the confidence and the support, respectively of the rule. 

From the set point of view, boolean association rules reveal the connection 
between two disjoint subsets of the same universe. Let the number of trans- 
actions be n and the number of items be m, the basket data can be stored in 
an information table with n rows and m columns. Each datum in the data 
table is boolean to specify whether or not an item is included in the respective 
transaction. This is why the rules are called boolean association rules. 

The Apriori [2j [3] algorithm is based on the Apriori property j^. It can 
mine all boolean association rules efficiently given the threshold of support and 
confidence. The FP-growth [T^ algorithm avoids candidate generation and 
therefore save computation time further. 

2.1.2. Quantitative association rules 

Quantitative association rule [36j was introduced to cope with data tables 
with quantitative attribute values. From the data type point of view, it is a 
generalization of the boolean association rule. It reveals the relationships among 
attribute values of an object. A well known application is mining information of 
people. An example of such rule is "10% of married people between age 50 and 
60 have at least 2 cars; 3% of all people queried satisfy this rule" [36] . Similar 
to the case of boolean association rules, here 10% is called the confidence of the 
rule, and 3% is the support of the rule. 

Since the Apriori property still holds in the new context, the Apriori algo- 
rithm can be designed accordingly [SSj. One can also follow the idea of FP- 
growth to design a more efficient algorithm. 
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2.1.3. Multi-level association rules 

Multi-level association rules [TH] reside at multiple concept levels to discover 
more specific and concrete knowledge from data. In addition to the transaction 
data, it requires a description table to indicate different levels. Suppose that 
category, content and brand represent the first, the second, and the third level 
concept respectively of a food. Two examples of such rules are "75% of people 
buy wheat bread if they buy 2% milk," and "82% of people buy bread if they 
buy 2% milk." However, the rule "60% of people buy products made of wheat if 
they buy 2% milk" is invalid since "products made of wheat" does not indicate 
the category. 

2.2. Relational association rules 

In recent years, multi-relational data mining (MRDM) [TD], also called rela- 
tional data mining (RDM), has been proposed to look for patterns that involve 
multiple tables. Accordingly, the concept of association rule has been extended 
with this regard to form relational association rules. There are various exten- 
sions, and we will discuss more popular ones. 

2.2.1. Extended boolean association rules 

Dehaspe et al. [H [H| , Dzeroski et al. [101 E] , and Afrati et al. [T] considered 
the case where binary relations can be chained to produce ternary relations, 
quaternary relations, etc. Suppose there are two binary relations, namely the 
parent-child relation and child-pet relation. A parent-child-pet relation can be 
produced using a SQL query on the database. An example of such rule is "if a 
person has a child, then this child has a pet with a probability of 30%; 20% of 
all people satisfy this rule." Here 30% is called the confidence of the rule, and 
20% is the support of the rule. 

We will call this type of rules extended boolean association rules since they 
can be viewed a direct extension of boolean association rules on a single table. 
The information carried by such rules are quite limited. They cannot indicate 
the number of children a person has, or the number of pets a child has. Nor can 
they specify other information, such as the age, of a parent or a child. 

Dehaspe et al. [8 designed a general purpose inductive logic programming 
algorithm called Warmr to mine this type of rules. Afrati et al. p] also tried 
to attack this problem using integer programming and graph approaches. 

2.2.2. Decentralized association rules 

Jensen et al. [TO] considered the case of decentralize tables. In this case the 
database contains n primary tables (i.e., tables with one primary key), and one 
central relationship table (i.e., a table with n foreign keys). An example of such 
rule is "if the ATM type is drive, then the age of the customer is between 20 
and 29." The computation of the confidence and support measures is the same 
as the table joined from all n + 1 tables. 

We will call this type of rules decentralized association rules. In fact, if 
n = 2, the database represents a many-to-niany relation, which is quite typical. 
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However, in real applications a central relationship table seldom exists for n > 2. 
Therefore these rules are valid for very special databases, or parts of a database. 

2.2.3. Simple conjunctive association rules 

Goethals et al. |14j considered mining association rules in arbitrary relational 
databases. This approach looks for pairs of SQL queries Qi and (52, such that 
"Qi asks for a set of tuples satisfying a certain condition and Q2 asks for those 
tuples satisfying a more specific condition" [T3j. When the number of tuples 
matching Q2 is close to that of Qi, a rule is created. An example of such rule 
is "actors starring in 'drama' movies typically (with a probability of 90%) also 
star in a 'comedy' movie." 

We will call this type of rules simple conjunctive association rules. The 
conjunction here is much more flexible than the case of extended boolean asso- 
ciation rules. In fact, any kind of SQL query is supported. Goethals et al. [14] 
designed the Conqueror algorithm to mine this type of rules. 

2.2.4. IPL-based association rules 

Kavurucu et al. [5T] considered one-to-many relationships. They extended 
the background knowledge with aggregate predicates in order to characterize the 
structural information that is stored in tables and association between them. 
In this way, logical patterns valid for given background knowledge are induced 
through Inductive Logic Programming (ILP) 

We will call this type of rules IPL-based association rules. Kavurucu et al. 
PT] designed a concept discovery system named Confidence-based Concept Dis- 
covery (C^D). C^D does not require user specification of input /output modes of 
arguments. Therefore it is suitable for non-expert users without much knowl- 
edge on the semantic detail of the relations. 

2.2.5. Separated counting association rules 

Goethals et al. [T^ also considered a more specific type of association rules. 
The frequency of a rule is not counted as the number of occurrences in the join 
of tables. Let the database consist tables Professor, Course and Student. For 
one particular kind of courses, the number of professors who teach them and 
the number of students who study them are counted separately. An example of 
such rule is "75% professors named Jan teaches courses with 10 credits, among 
30% of all courses." Here 75% is the confidence and 30% is the relative support. 

We will call this type of rules separated counting association rules. Unfortu- 
nately, the counting mechanism is not good enough. For example, a professor 
may teach only one course with 10 credits, or she may teach all courses with 10 
credits. This type of rules does not contain such information. 

Unfortunately, there are at least two drawbacks of existing relational asso- 
ciation rule mining works. First, the general relational association rule mining 
problem usually involves the join operation of multiple data tables [19]. When 
the sizes of these data tables are large, it is simply impossible to join more than 
two tables. Second, as the association rule becomes more complex in the context 
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Table 1: Customer 



CID 


Name 


Af 


ye 


Gender 


Married 


Country 


Income 


NumCars 


cl 


Ron 


20. 


.29 


Male 


No 


USA 


60k.. 69k 


0..1 


c2 


Michelle 


20. 


.29 


Female 


Yes 


USA 


80k.. 89k 


0..1 


c3 


Shun 


20. 


.29 


Male 


No 


China 


40k.. 49k 


0..1 


c4 


Yamago 


30. 


.39 


Female 


Yes 


Japan 


80k.. 89k 


2 


c5 


Wang 


30. 


.39 


Male 


Yes 


China 


90k.. 99k 


2 



of RDM, the support and confidence measures are not enough to evaluate the 
strongness of the rule. 

3. Granular association rules with three subtypes 

In this section, we will introduce granular association rules to address the 
drawbacks of existing types mentioned in the last section. We will first discuss 
the data model for the new type. Then we present three subtypes of rules 
and one general case corresponding to four different explanations of granular 
association rules. At the same time, a number of measures are proposed to 
evaluate the quality of these rules. A comprehensive comparison with existing 
types will be made at the end of the section. 

3.1. The data model 

First we need to revisit the definitions of information systems and binary 
relations. 

Definition 1. S — {U, A) is an information system, where U = {xi, X2, ■ ■ ■ , Xn\ 
is the set of all objects, A = {ai,a2, . . . is the set of all attributes, and 

cijixi) is the value of Xi on attribute aj for i G [l..n\ and j £ [l..m\. 

An example of information system is given by Table [ij where U = {cl, c2, 
c3, c4, c5}, and A — {Age, Gender, Married, Country, Income, NumCars}. 
Another example is given by Table |2] 

In an information system, any A' C A induces an equivalent relation |33l 135) 

Ea' = {(a;, y)eU X [/|Va e A', a{x) = a{y)}, (1) 

and partitions U into a number of disjoint subsets called blocks. The block 
containing x G U is 

Ea' {x) = {ye U\ia G A', a{y) = a{x)}. (2) 

From another viewpoint, a pair C — [A'^x) where a; G [/ is called a concept. 
The extension of the concept is 

ET{C) = ET{A' ,x) ^ Ea'{x)] (3) 
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Table 2: Product 



PID 


Name 


Country 


Category 


Color 


Price 


pl 


Bread 


Australia 


Staple 


Black 


1..9 


p2 


Diaper 


China 


Daily 


White 


1..9 


p3 


Pork 


China 


Meat 


Red 


1..9 


p4 


Beef 


Austraha 


Meat 


Red 


10.. 19 


p5 


Beer 


France 


Alcohol 


Black 


10.. 19 


p6 


Wine 


France 


Alcohol 


White 


10.. 19 



while the intension of the concept is the conjunction of respective attribute- 
value pairs, i.e., 

IT{C) ^ IT{A',x) = /\ (a : a{x)). (4) 

aeA' 

The support of the concept is the size of its extension divided by the size of the 
universe, namely, 

support{C) = support{A' , x) = support{/\^^^,{a : a{x))) 

= support{EA'{x)) =1^1^^ (5) 

- \u\ ■ 

Definition 2. Let U = {xi,X2, ■ ■ ■ , Xn} and V = {yi, 2/2, ■ ■ ■ , Uk} be two sets of 
objects. RC U X V is a binary relation from U to V. 

Rix) = {yeV\ix,y)eR}, (6) 

R~\y)^{xeU\{x,y)eR}. (7) 

A binary relation is more often stored in the database as a table with two 
foreign keys. In this way the storage is saved. For the convenience of illustration, 
here we represented it with an n x fc boolean matrix. An example is given by 
Table [3j where U is the set of customers as indicated by Table [T] and V is the 
set of products as indicated by Table [2j 

With Definitions [l] and |2] we propose the following definition. 

Definition 3. A many-to-many entity-relationship system (MMER) is a 5- 
tuple £'5' — {U,A,V,B,R), where {U,A) and {V,B) are two information sys- 
tems, and R C U X V is a. binary relation from U to V. 

An example of MMER is given by Tables [ij [2] and [3| 

3.2. Granular association rule with three subtypes 

A granular association rule is an implication of the form 

(GR) : /\{a: a(x)) ^ A ^ ^(2^))' 

a£A' b£B' 
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Table 3: Buys 



CID\ PID 


Pl 


p2 


p3 


p4 


p5 


p6 


cl 


1 


1 





1 


1 





c2 


1 








1 





1 


c3 





1 


1 





1 


1 


c4 





1 





1 


1 





c5 


1 





1 


1 


1 


1 



where A' C A and B' <Z B. 

According to Equation ([s]), the set of objects meeting the left-hand side of 
the granular association rule is 

LH{GR)^Ea'{,x)- (9) 

while the set of objects meeting the right-hand side of the granular association 
rule is 

RH{GR)^EB'{y). (10) 

We define two measures to evaluate the generality of the granular association 
rule. The source coverage of GR is 

scoverage{GR) = (11) 

while the target coverage of GR is 

tcoverage{GR) = L^^i^. (12) 

In most cases, rules with higher source coverage and target coverage tend to be 
more interesting. We present a granular association rule for discussion. 

(Gender: Male) =^ (Category: Alcohol) , , 

[scoverage = aQVo^tcoverage — 2>i%\. 



A direct explanation of Rule (131 is "men like alcohol." However, this ex- 
planation is ambiguous and the following questions may arise: Do all men like 
alcohol? Do men like all kinds of alcohol? To avoid such ambiguity, more mea- 
sures of the rule are needed. We propose four different explanations of this 
rule, as illustrated in Figure [T] and will discuss them from simple ones to more 
general ones. Note that exemplary rules discussed in the following context may 
not comply to the MMER given by Tables [l] [2] and [3| 

3.2.1. Complete match 



The first explanation of Rule ( 13 1 is "all men like all alcohol," or equivalently, 
"100% men like 100% alcohol." This can be formally expressed by the following 
definition. 
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Complete match rule: 
"All men like all kinds of alcohol." 



special 



Left-hand side partial match mle: 
"40% men like all kinds of alcohol. 



Right-hand side partial match mle: 
"All men like at least 30% kinds of alcohol.' 
Partial match rule: 
"40% men like at least 30% kinds of alcohol" 



general 



Figure 1: Four explanations of "men like alcohol" 



Definition 4. A granular association rule GR is called a complete match gran- 
ular association rule iff 

LH{GR) X RH{GR) C R. (14) 

It is also called a complete match rule for brevity. We need to know the 
percentage of objects in U matching the rule. It is called the support of the rule 
and defined by 

supportc[GR) — scoverage{GR) = — ^, (15) 

where the suffix c stands for complete. Although the support is equal to the 
source coverage, we still define this measure since in other subtypes they are 
different. Under this context, the rule 

(Gender: Male) (Category: Alcohol) 
[scoverage = 60%, tcoverage — 33%], ^ ' 

will be read as "all men like all kinds of alcohol; 60% of all people are men; 33% 



of all products are alcohol." Note that Rules ( 13 1 and ( 16 1 have the same form. 



However the explanation of Rule ( 16 1 causes no ambiguity under the context of 
complete match. 

3.2.2. Left-hand side partial match 



The second explanation of Rule ( 13 ) is "some men like all alcohol," or equiv- 



alently, "at least one man like 100% alcohol." Because "some" appears on the 
left-hand side, the rule is called "left-hand side partial match." Consequently, 
we define a subtype of granular association rule as follows. 

Definition 5. A granular association rule GR is called a left-hand side partial 
match rule iff there exists x € LH{GR) such that 

R{x) D RH{GR). (17) 
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In applications, however, if very few men like all kinds of alcohol, this rule 
is not quite useful. We need to know the percentage of men that like alcohol. 
The support of the rule is 

supporUAGR) = \{^^LHiGRm.)^RHiGR)}\^ ^^^^ 

In other words, only men that like all kinds of alcohol are counted. Moreover, 
the source confidence of the rule is 

One may obtain the following rule 

(Gender: Male) (Category: Alcohol) 
[scoverage = 60%, tcoverage — 33%, sconfidenceip = 67%], 

which is read as "67% men like all kinds of alcohol; 60% of customers are men; 
33% of products are alcohol." We deliberately avoid the support measure in this 
explanation; the reason will be discussed in the next subsection. 

3.2.3. Right-hand side partial match 



The third explanation of Rule (13) is "all men like some kinds of alcohol," 
or equivalently, "100 % men like at Icat one kind of alcohol." Because "some" 
appears on the right-hand side, the rule is called "right-hand side partial match." 
Consequently, we define a subtype of granular association rule as follows. 

Definition 6. A granular association rule GR is called a right-hand side partial 
match rule iff Vx G LH(GR), 

R{x) n RH{GR) ^ 0. (21) 

Similar to the case of complete match, the support of the rule is equal to the 
source coverage. It is given by 



supportrp{GR) = scoverage{GR) = (22) 



In the case of complete match and left-hand side partial match, bigger target 
coverage values indicate stronger rules. Unfortunately, in the case of right-hand 
side partial match, bigger target coverage values indicate weaker rules. Consider 
one extreme case as follows: "all customers like at least one kind of all products." 
The rule always holds, and both the source coverage and the target coverage of 
the rule are 100%, but the rule is totally useless. 

Therefore we need to know how many kinds of alcohol men like. Here we 
introduce a new measure called target confidence for this purpose. The target 
confidence of the right-hand side partial match rule is 



\R{x)nRH{GR)\ 
xeLHiGR) \RH{GR)\ 



tconfidencerpiGR) = min iDrx/!^o\ ■ (23) 
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With existing measures, we may obtain the fohowing rule 

(Gender: Male) (Category: Alcohol) , . 

[scoveragerp = 60%,tcoverage~33%,tconfidencerp = 50%\. 

which is read as "all men like at least 50% of alcohol; 60% of customers are 
men; 33% of products are alcohol." 

3.2.4- Partial match 



The fourth explanation of Rule ( 13 1 is "some men like some kinds of alcohol," 
or equivalently, "at least one man like at least one kind of alcohol." Because 
"some" appears on both sides, the rule will be simply called "partial match." 
Consequently, we define this type of granular association rule as follows. 

Definition 7. A granular association rule GR is called a partial match granular 
association rule iff there exists x S LH(GR) and y S RH{GR) such that 

{x,y)eR. (25) 

It is also called a partial match rule for brevity. According to the definition, 
partial match is a general case of granular association rules. Therefore we cannot 
call it a subtype. 

There is a tradeoff between the source confidence and the target confidence 
of a rule. Consequently, neither value can be obtained directly from the rule. 
To compute any one of them, we need to specify the threshold of the other. Let 
tc be the target confidence threshold. The support of the partial match rule is 

supportpiGR, tc) = ^ ^ ^^RHiGR)\ (26) 

Here tc is a necessary parameter. For convenience, in some cases we may ignore 
it to keep the same form as others. The source confidence of the partial match 
rule is 

|{xeLi?(Gi?)|^^MSf^>te}| 

sconftdencep{GR,tc) = \LH{GR)\ ' ^^"^^ 

Let mc be the source confidence threshold, and 

\{x e LH{GR)\\R{x) n RH{GR)\ >K + l}\ 
<mcx\LH{GR)\ (28) 
< |{a; e LH{GR)\\R{x) n RH{GR)\ > K}\. 

The target confidence of the partial match rule is 

K 

tconfidencep{GR,mc) ^ j^^j^—^^. (29) 

In fact, the computation of K is non-trivial. First, for any x G LH{GR), we need 
to compute tc{x) — \R{x) n RH{GR)\ and obtain an array of integers. Second, 
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Table 4: Summary of source confidence and target confidence 



Subtype \ Measure 



Source confidence 



Target confidence 



Complete matdi 

Left-hand side partial match 

Right-hand side partial match 

Partial match 



100% 

\{xeLH{GR)\R{x)DRH{GR)}\ 
\LH{GR)\ 

100% 

\LH(GR)\ 



100% 
100% 

\R{x)nRH(GR)\ 
nun,^LH{GR) IRH{GR)\ 

K 

\RH{GR)\ 



we sort the array in a descending order. Third, let k = [mc x \LH{GR)\\, K is 
the fc-th element in the array. 

With existing measures, we may obtain the following rule 

(Gender: Male) ^ (Category: Alcohol) 
[scoverage = 60%, tcoverage — 33%, sconfidencep — 40%, tconfidencep = 30%]. 

(30) 

which is read as "40% men like at least 30% of alcohol; 60% of customers are 
men; 33% of products are alcohol." 



3.3. Discussion of measures 

We have presented five measures to evaluate the quality of granular associ- 
ation rules. The source coverage is always ^^^^^^j^jp^, and the target coverage is 



always 



\RH{GR)\ 

\u\ ■ 




summaries source confidence and target confidence. 



From Equations ([15j), ([Igf, ([19|), (122|, (l26j) and we know that for all four 
cases, there is a direct connection among the support, source coverage and con- 
fidence of a rule. 



support^,{GR) = scoverage{GR) x sconfidencei,{GR), (31) 

where the suffix "*" could be replaced by c, Ip, rp and p. Hence any one of 
these three measures can be viewed redundant. For convenience, in the following 
context we will ignore the support measure. 

3.4. Alternative definitions 

It is worth noting that Definitions [5] and [6] are asymmetric. A symmetric 
definition of Definition [5] is 

Definition 8. A granular association rule GR is called a type-2 right-hand side 
partial match rule iff there exists y € RH{GR) such that 

R^\y) ^ LH{GR). (32) 
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With Definition [Sj we have the following explanation of the rule "at least one 
kind of alcohol favors all men." Moreover, a symmetric definition of Definition 
Elis 

Definition 9. A granular association rule GR is called a type-2 left-hand side 
partial match rule iff Vy G RH{GR) 

R-\y)nLH{GR) ^(d. (33) 

With Definition [9) we have the following explanation of the rule "all kinds 
of alcohol favors at least one man." Unfortunately, the subject these new rules 
are concepts in V, and the relation under consideration is R^^ instead of R. 
Therefore the alternative definitions are not appropriate for our situation. 

3.5. Comparison with existing types 

We now compare granular association rules with other types of association 
rules mentioned in Section [21 

1. Both boolean association rules and granular association rules deal with 
binary relations on two universes. For granular association rules, objects 
in either universe are described by a number of attributes. Therefore 
granular association rules reveal connections between object subsets in 
two universes, while boolean association rules reveal connections between 
objects in one universe. 

2. Both quantitative association rules and granular association rules deal 
with quantitative data. Moreover, the data sources are all described by 
attributes. Quantitative association rules involve only one universe, while 
granular association rules always involve two. 

3. Both multi-layer association rules and granular association rules describe 
objects with attributes. Multi-layer association rules have a predefined 
concept hierarchy with a tree structure, which does not exist for granular 
association rules. Moreover, Multi-layer association rules involve only one 
universe. 

4. Extended boolean association rules may involve more than two data ta- 
bles. Similar to boolean association rules, objects are not described by 
attributes. Therefore they reveal connections between objects in different 
universes. 

5. Decentralized association rules involve at least two primary tables. From 
this viewpoint, they are more general than granular association rules. As 
mentioned earlier, this type of rules have a special requirement on the 
database. Hence they are less useful than granular association rules. 

6. Simple conjunctive association rules are quite flexible. They reveal the 
connections between a object set and one of its subsets. And the motiva- 
tion is totally different from granular association rules. 

7. IPL-based association rules consider one-to-many relationships, while gran- 
ular association rules consider many-to-many relationships. Moreover, 
C^D for IPL-based association rules does not require much user specifica- 
tion, while granular association rules require four thresholds for measures. 
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8. Granular association rules have the same form as separated counting as- 
sociation rules. The number of objects for a rule is counted locally in one 
universe, therefore the joining of tables is unnecessary. One important 
difference between two types lies in that granular association rules have 
more measures, therefore they are semantically richer. 

There is still another closely related technique called collaborative recom- 
mendation |4] or collaborative filtering [15] . This technique also considers many- 
to-many relationships with some interesting applications such as product rec- 
ommending and web page recommending. Compared with granular association 
rules, this technique focuses more on particular applications. Therefore from 
one viewpoint granular association rules can be employed in collaborative rec- 
ommendation. From another viewpoint, collaborative recommendation can be 
viewed as a local approach since we always recommend something to a user. In 
contrast, granular association rules can be viewed as a global approach which 
outputs only strong rules. 



4. Granular association rule mining algorithms 

In this section, we first define the granular association rule mining problem. 
Then we propose a sandwich algorithm with four rule checking approaches, one 
for partial matching rules and three for subtypes. Naturally, the one for partial 
matching rules is also valid for three subtypes. Then two more algorithms are 
designed for the complete match subtype. Time complexities of all algorithms 
are analyzed. 

4-.1. The granular association rule mining problem 
We now define the problem as follows. 

Problem 10. The granular association rule mining problem. 

Input: An ES = {U, A,V, B, R) , a minimal source coverage threshold ms, 
a minimal target coverage threshold mt, a minimal source confidence threshold 
mc, and a minimal target confidence threshold tc. 

Output: All granular association rules satisfying scoverage{GR) > ms, 
tcoverage{GR) > mt, sconfidencCp^GR) > mc, and tconfidencep{GR) > tc. 

4-2. A sandwich algorithm 

A straightforward algorithm for Problem 10 is given by Algorithm [l] It 
essentially has three steps. 

Step 1. Search in {U,A) all concepts meeting the minimal source coverage 
threshold ms. This step corresponds to Line 1 of the algorithm, where SC 
stands for source concept. 

Step 2. Search in (V, B) all concepts meeting the minimal target coverage 
threshold mt. This step corresponds to Line 2 of the algorithm, where TC 
stands for target concept. 
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Algorithm 1 A sandwich algorithm for partial match 
Input: ES — {U, A, V, B, R), ms, mt, mc, tc. 
Output: All partial match rules satisfying given constraints. 
Method: partial-match-sandwich 

1: SC{ms) = e 2-4 X C/|L^^ > ms}; 

2: TC[mt) = {{B',y) G 2^ x > mt}- 

3: for each C e SC{ms) do 

4: for each C G TC{mt) do 

5: GR={IT{C)^ IT{C')); 

6: if sconfidencep{GR,tc) > mc then 

7: output rule GR; 

8: end if 

9: end for 

10: end for 



Step 3. Check all possible rule regarding SG and TC, and output valid 
ones. This step corresponds to Lines 3 through 10 of the algorithm. 

Since this algorithm starts from both ends of the association rule and pro- 
ceeds to the middle, it is called the "sandwich" algorithm. Note that the check 
of the condition sconfidencep{GR,tc) > mc in Line 6 is non-trivial. And it 
indicates both thresholds of source confidence and target confidence should be 
met. 

Now we discuss the algorithm in more detail. The Apriori algorithm [31 136) 
and the FP-growth algorithm [T^ can be employed in Lines 1 and 2. These 
algorithms are based on the Apriori property, which is stated as "every subset 
of a frequent itemset must also be a frequent itemset" [3]. Under our context, 
the Apriori property can be restated as follows. 

Property 11. Let A" C A' C A and x eU. 



Naturally, for three subtypes, the condition expressed by Line 6 of the al- 
gorithm might be replaced by simpler ones. We will explain the cases for each 
subtypes. 

4-2.1. Complete match 

If mc — tc = 100%, we are essentially looking for complete match rules. The 
condition can be replaced by 



Moreover, in this case some checks are redundant. We have the following 
property. 



\Ea'{x)\ < \Ea"{x)\. 



(34) 



ET{G) X ET{G') C R. 



(35) 
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Property 12. Let A" C A' C A, x e U, B" C B' C B, and y e V. If 

ET{A",x) X ET{B",y) C R, 

ET{A', x) X ET{B', y) C R. (36) 

Proof. Because A" c A', ET{A',x) C ET{A",x). Similarly ET{B',y) C 
ET{B",y). Therefore ETiA',x) x ET{B',y) C ET{A",x) x ET{B",y). And 
the property holds. 

Property [12] is essentially another form of the Apriori property. Its converse 
negative proposition can be used to remove unnecessary check of rules. Note 
that changes can be made on both sides of the rule. For example, if rule "all 
Chinese men like all kinds of France alcohol" does not hold, then rule "all men 
like all kinds of alcohol" never holds. 



4-.2.2. Left-hand side partial match 

If tc — 100%, we are essentially looking for left-hand side partial match rules. 
The condition can be replaced by 

\{x e LH{GR)\R{x) D RH{GR)}\ 

\LH{GR)\ - 

4-2.3. Right-hand side partial match 

If mc = 100%, we are essentially looking for right-hand side partial match 
rules. The condition can be replaced by 

\R(x) n ET(C')\ 
min J— V — ,^.\ > tc. (38) 

Similar to the case of complete match, we would like to remove unnecessary 
check of rules. In fact, we have the following property. 

Property 13. Let A" C A' C A, x e U, B' C B, and y €V. 

mx)nEnB'y)\^ ^.^ \Rix)nEmB'y)\^ 

x'eET(A',x) \ET{B',y)\ - x'eET{A",x) \ET{B' ,y)\ ^ ' 



Proof. Because A" C A', ET{A", x) D ET{A', x). Hence Equation ([M]) holds. 

Property [13] indicates one approach to removing unnecessary check concern- 
ing the left side of the rule. Unlike Property [12] in this case the change cannot 
be made on both sides. For example, if rule "all Chinese men like at least 30% 
kinds of France alcohol" does not hold, then "all men like at least 30% kinds of 
France alcohol" never holds. However, "all Chinese men like at least 30% kinds 
of alcohol" may hold. 

Now we analyze the time complexity of the algorithm. For the partial match 
subtype, from Equation (26) we know the complexity of Line 6 is 

Oi\ETiC)\x\ET{C')\) = Oi\U\x\V\). (40) 
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Algorithm 2 A forward algorithm 
Input: ES = ([/, A, V, B, R), ms, mt. 

Output: All complete match granular association rules satisfying given 
constraints. 

Method: complete-match-rules- forward 

1: SC{ms) = e 2-4 X C/|L^^ > ms}; 

2: TC{mt) = {iB',y) G 2^ x ^1^^^^^ > mt}; 



for each C € SC{ms) do 
X = £;T(C); 

r = i?(A:); 

for each C" G TC{mt) do 
if (^T(C7') C r) then 

output rule IT{C) ^ IT{C'); 
end if 
end for 
end for 



According to the for loops, the time complexity of Algorithm [T] is 

0{\SC{ms)\ X \TC{mt)\ x \U\ x \V\). (41) 

For the complete match subtype, suppose that both ET{C) and ET{C') 
are stored in 1-dimensional positive number arrays. Each element in the array 
indicates the inclusion of one particular object in the concept. For example, 
[1,4,8] indicates {xi^Xi^x^} . Suppose further that R is stored in a \U\ x \V\ 
boolean array. The time complexity of checking ET{C) x ET{C') C i? is the 
same as that of partial match as indicated by Equation (|40|) . Consequently, this 



time complexity for the complete match subtype is also given by Equation (41 ). 
However, checking ET{C) x ET{C') C R ends immediately once a violation of 
the relationship is found. Compared with the check of sconfidencep{GR, tc) > 
mc, it is less time consuming. 

Similarly, for the other two subtypes, the time complexities are all given 



by Equation (41). The run time for different subtypes will, however, be very 



different in applications. This will be shown through experiments in Section [sj 

4-. 3. Two algorithms for the complete match subtype 

The time complexity of the sandwich algorithm is quite high. Now we pro- 
pose two alternative approaches for the complete match subtype. We will show 
that their time complexities are lower than Algorithm [ij 

4- 3.1. A forward algorithm 

The first alternative approach is called the "forward" approach. It starts 
from the left-hand side of the rule and proceeds to the right-hand side. The 
algorithm is listed in Algorithm [2j It essentially has four steps. 
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steps 1 and 2. They are the same as Algorithm [T] 

Step 3. For each concept obtained in Step 1, construct a block in V ac- 
cording to R. This step corresponds to Line 4 of the algorithm. The function 
ET has been defined in Equation ([S]). We introduce a new concept regarding 
Line 5. 

Definition 14. Let U and V be two universes, R C U xV he a binary relation, 
X Q U. The lower approximation of X with respect to R is 

R{X)^{yeV\R~\y)DX}. (42) 

In our example, R{X) are all products that favor all people in X. The concept 
"lower approximation" comes from rough sets (33) . However, we consider two 
universes here instead of only one. 

Step 4. Check possible rules regarding C and Y, and output all rules. This 
step corresponds to Lines 6 through 10 of the algorithm. In Line 7, since ET{C') 
and Y could be stored in sorted arrays, the complexity of checking ET{C') C Y 

0{\ET{C')\ + \Y\) = Oi\V\). (43) 
According to the for loops, the time complexity of Algorithm [2] is 

0{\SC(ms)\ X \TC{mt)\ x \V\), (44) 
which is lower than Algorithm [T] 

4-4- The backward algorithm 

The backward algorithm, which is a dual of Algorithm [2] is listed in Al- 
gorithm [3j It starts from the right-hand side of the rule and proceeds to the 
left-hand side. It is symmetric with respect to Algorithm [2j According to Def- 
inition [li} R-^ (Y) = {x e U\R{x) 3 Y}. In our example, R~^ (Y) are ah 
people buying all products in Y. Similar to the analysis of Algorithm [2j the 
time complexity of Algorithm [3] is 

Oi\SC{ms)\ X \TC{mt)\ x \U\). (45) 

Now one question arises: which algorithm performs better? According to 



Equations (44 1 and (45), we should choose the forward algorithm if \U\ > 
and the backward algorithm otherwise. This issue will be discussed further 
through experimentation in Section [5] 



5. Experiments on a real world dataset 

In this section, we try to answer the following problems through experimen- 
tation. 

1. Do granular association rules make sense in real- world applications? 

2. Do dedicated approaches for different subtypes improve the performance 
of the sandwich algorithm? 

3. Do the forward and backward algorithms outperform the sandwich algo- 
rithm significantly? 
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Algorithm 3 The backward algorithm 
Input: ES = {U,A,V,B,R), ms, mt. 

Output: All complete match granular association rules satisfying given 
constraints. 

Method: complete- match-rules-backward 

1: SC{ms) = {{A',x) e 2^ X i7|^^^ > ms}; 

2: TC{mt) = {{B',y) G 2^ X F]^^^^ > mt}; 

3: for each C" G TC(mt) do 

4: Y = ET{C')- 

5: X = Rr±{Y)- 

6: for each C e SC(ms) do 

7: if {ET{C) C X) then 

8: output rule IT{C) ^ /r(C"); 

9: end if 

10: end for 

11: end for 



5.1. Dataset 

Wc obtained a real-world dataset from Zhangzhou Normal University. The 

database schema is as follows. 

• Student (studentID, name, gender, birth-year, politics-status, grade, de- 
partment, nationality, length-of-schooling) 

• Coiuse ( courscID , credit, class-hours, availability, department) 

• Selects (studcntID, courselD) 

We collected data during the semester between 2011 and 2012. There are 145 
general education courses in the university, and 9,654 students took part in 
course selection. 

5.2. Results 

We undertake three sets of experiments to answer the questions raised at 
the beginning of the section one by one. 

5.2.1. The meaningfulness of rules 

We obtain some strong rules using the sandwich algorithm. Let ms = 0.06, 
mt = 0.06, mc = 0.18, and = 0.11. 40 granular association rules are obtained, 
and 4 of them listed below. 
(Rule 1) (department: economics) 

=4> (department: human-resource) 
(Rule 2) (nationality: han) A (department: economics) 

=> (credit: 1) A (department: human-resource) 
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(Rule 3) (politics: league-member) A (nationality: ban) A (department: economics) A 
(length-of-schooling: 4) 

^ (credit: 1) A (department: Human-resource) 
(Rule 4) (birth-year: 1993) A (nationality: han) A (length-of-schooling: 4)A(grade: 2011) 
^ (credit: 1) A (department: human-resource) 

All rules are quite meaningful, and they might be employed for course rec- 
ommendation directly. Rule 1 indicates that students in the economics like 
courses offered by the human-resource department. We observe that Rule 3 is 
finer than Rule 2, which is in turn finer than Rule 1. It happens that all three 
rules hold under the given setting. Rule 4 is not comparable with other three 
rules in terms of granulation. 

5.2.2. The performance of dedicated rule checking approaches 

We study the performance of the sandwich algorithm for different subtypes. 
We focus on Step 3 of the algorithm since it is most time consuming than Steps 
1 and 2 for large datasets, and it is different for subtypes. The algorithm chooses 
the appropriate subtype according to mc and tc settings, as indicated in Section 



4.2 We dehberately set mc and/or tc to 0.95 such that different subtypes are 
chosen, while the rule set is the same as the case of 1. 

The results are listed in Table [5j where basic operation refers to comparison, 
addition, etc. Here we observe that the dedicated approaches for three subtypes 
are significantly faster than the one for the general case. For example, when 
ms = mt = 0.01, approaches for complete match subtype, left-hand side partial 
match subtype, and right-hand side partial match subtype are 866, 128, and 
174 times faster than the general partial match subtype. Generally, the speed 
of algorithms for three subtypes are 2-3 orders of magnitudes faster than the 
one for the general case. 

5.2.3. The performance of different algorithms 

We compare the sandwich algorithm, the forward algorithm and the back- 
ward algorithm for the complete match subtype. Only the number of basic 
operations are compared, as depicted in Figure [2] Here we set ms — mt and let 
they range from 0.007 to 0.01. We observe that the forward algorithm generally 
perform the best. It is about one time faster than the sandwich algorithm. 



Note that the speed up is not as significant as indicated by Equations (41) 



(44) and (45). Nor does the backward algorithm outperform the forward algo- 
rithm when U\ < \V\. One important reason is that rule checking terminates 
once certain conditions are met, therefore introducing much uncertainty to the 
run time. Consequently, the time complexities are for reference only, and the 
run time depends more on the characteristics of data. This might be a common 
phenomenon for data mining algorithms. 

6. Conclusions and further works 

In this paper, we have proposed granular association rules to reveal many- 
to-many relationships in relational databases. They have wide applications such 
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Table 5: Run time of Step 3 for different settings 
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as collaborative recommendation [J] and collaborative filtering [T5]. Four mea- 
sures have been defined to evaluate the quality of these rules. Therefore the 
new type of rules are semantically richer than existing ones. We also proposed 
three algorithms for association rule mining, and compared algorithm efficiency 
through experimentation. 

The following research topics deserve further investigation: 

1. Different types of data for object description. In this work we consid- 
ered only symbolic data for describing objects. It is necessary to consider 
numeric data, heterogenous data [IB], interval valued |7] data and data 
with missing values |35j. There are some neighborhood systems concern- 
ing distance [IS] or error ranges [30] to formalize these data. Respective 
approaches (see, e.g., [7l[T8l[30]) can be also employed for these issues. 
Moreover, there might be test cost while obtaining data [29l|28]. Hence 
we should also consider cost data in certain applications. 

2. Different granular association rule mining problems. In the problem def- 
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Figure 2: Basic operations of three algorithms 



inition of this paper, four thresholds are needed as the input. We may 
provide other means of parameter setting for non-expert users. For exam- 
ple, we may mine top-K interesting rules where K is easy to specify. We 
may need to remove redundant rules |32l I34j and common sense rules to 
avoid pattern explosion [37] . 

3. Efficient algorithms to these problems. As discussed in Section |4j the 
time complexities of proposed algorithms are rather high. For datasets 
with hundreds of thousands of objects, these algorithms may take too 
much time. Therefore we need to improve the speed of the algorithms 
dramatically through taking fully advantage of the Apriori property indi- 
cated in Section |4j Rough sets approach to association rule mining [3Tj 
may be also employed for this purpose. Moreover, since our algorithms 
are essentially exhaustive ones, it may be even necessary to design heuris- 
tic algorithms for large datasets. Consequently, we may design heuristic 
algorithms [37| to these problems. 

4. Theoretical foundations of these problems and algorithms. The forward 
and the backward algorithms make use of concept approximation from the 
viewpoint of rough sets [33] , especially the one for two universes [211 |3S] . 
These two algorithms consider only complete matching rules, therefore 
the classical rough set model is employed. For the general case and two 
other subtypes, we may need variable precision rough sets |45j or decision 
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theoretical rough sets [201 HSl UHl SI]- There are at least two types of 
coverings induced by binary relations in this scenario. The first type of 
coverings is induced by binary relations. Given an element in one universe, 
the binary relation always induces a subset in other. In this way, from all 
elements in one universe, a cover of the other universe is induced. The 
second type of coverings is induced by granular association rules. Either 
side of a rule corresponds to a concept, which describes a covering block. 
Covering-based rough sets [27l |43l |44] are a natural approach for these 
issues. 

To sum up, granular association rule mining is a challenging problem due 
to pattern explosion [37j . It may benefit from rough sets, especially variable 
precision rough sets [IS] and covering-based rough sets [33] . Therefore this work 
has opened a new research trend concerning granular computing, association 
rule mining, and rough sets. 
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